WORLD INTELLECTUAL PROPERTY ORGANIZATION 
International Bureau 




PCX 

INTERNATIONAL APPLICATION PUBUSHED UNDER THE PATENT COOPERATION TREATY (PCT) 



(51) International Patent Classification ^ : 
G06F 12/00, 13/00 



Al 



(11) International Publication Nnmbcr: WO 97/07461 

(43) International Publication Date: 27 Febniary 1997 (27.02.97) 



(21) International Application Number: PCT/US96/ 13238 

(22) International Filing Date: 15 August 1996 (15.08.96) 



(30) Priority Data: 
08/51 6;293 



17 August 1995 (17.08.95) 



US 



(71) AppUcant: BORG TECHNOLOGIES, INC. [USAJS]; 1341 

Cannon Street. Louisville, CO 80027 (US). 

(72) Inventors: STAIXMO, David, C; 59 Beaver Way. Boulder. 

CO 80304 (US). HALL, Randy, K.; 400 Oneida Street, 
Boulder. CO 80303 (US). 

(74) Agent: YOUNG, James. R.; Suite 385, 12000 N. Washington 
Street, Denver, CO 80241 (US). 



(81) Designated States: CA. European patent (AT, BE, CH, DE, 
DK, ES, H, FR, GB. GR. IE, IT. LU, MC, NL, PT, SE). 



Published 

Wiih international search report. 



(54) Titie: METHOD AND APPARATUS FOR IMPROVING PERFORMANCE IN A REDUNDANT ARRAY OF INDEPENDENT 
DISKS 




(57) Abstract 

A RAID disk array (106. 110. 1 12. 114. 1 16) thai is adaptable to host I/O traffic, wherein the RAID configuration is hidden from 
the host computer (102). The system dynamically determines the RAID configuration used to store host data to maximize response time 
peifonnance and minimize the loss of disk space used for data protection. To maximize response time and avoid a write penalty, small 
write operations are mapped into RAID 1 configurations, and medium and large write operations are mapped into RAID 3 configiirations. 
These segments are migrated into RAID 5 configurations as a background operation, to minimize the disk space lost. The system hides 
configuration changes necessary for the addition and/or deletion of disks to the disk array. While these changes are in progress, the disk 
array (106, 1 10. 1 12. 114. 116) remains on-line and all host data is available for access and modification. 
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METHOD AND APPARATUS FOR ^^^L^^^^it?^ ^ 
REDUNDANT ARRAY OF INDEPENDENT DISKS 

TECHNICAL FIELD 
TWsinventionrelatestocompmersystemsand more particularly to disk 

5 computer systems. Even more particularly, the invention relates to a Redundant Array of 

Independent Disks (RAID) system. 

BACKGROUND OF THE INVENTION 

In a typical computer system, several disk devices are attached to a host computer. Data 
blocks are transferred between the host computer and each ofthe disks as appUcationprogramsi^ 

10 or write data from or to the disks. This data transfer is accomplished through a data I/O bus that 
connects the host computer to the disks. One such data I/O bus is calledasmall computer system 

interface (SCSI) bus and is commonly used on systems ranging in size from large personal 
computers to small mainframe computers. 

Although each drive amehed to the SCSI bus can store large amounB of data. Ae drives 
, 5 physMv catuK,. locaK and retrieve dau fast enough to match the speed of a larger host ptocessor. 
and *is HmitaUon creates an I/O bonleneC in the systen. To further aggravate .he problem, 
system configurations ftev«Mly dedicate on. drive .o one specific application. For example, m 
the Unix (tm) Operating System, a Unix file system can be no larger titan a single disk, and often 
a single disk is dedicated to a single file system. To improve performance, a particular file system 
20 maybededicatedtoeachapplicationbeingrun. Urus. each appUcaUon will access a different dtsk. 

improving performance. 

Disk arrays, often called redundant arrays of independent disks (RAID), allevmte this I/O 
bottleneck by distributing the I/O load ofasingle large drive across multiple smaller dnves^T^^ 

SCSI interface sends commands and data to the RAID system, and a controller within the RAID 
25 systemreceivesthecommandsanddata.delegatestaskstoindependentprocesseswithmthearT^^^ 
controllcand these independent processes address one or more ofthe independent disks attached 

to the RAID system to provide the data transfer requested by the host system. 

One way a RAID system can improve performance is by striping data. Striping of data is 
done by writing data from a single file system across multiple disks. This single file system still 
30 appears to the host system as a single disk, since the host system expects a single file system to be 
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U»widU,ofd,ediska™v S,ri«d^. ™'°^«'»'»»™P«->ion on each disk ac^ss 
read or „H,e ope.«„„ anU=ipa„d « ^ ^1^^^ ^' ■f-" 

-e^i:r=— — ^^^^^ - 

later further defined and expanded by an industrv of Cahforma at Berkeley and 

..^.dar::r~- - - „„ . 

«s«ped.„«„drtve.acco^<„,,„^.,j:~r;:™''°"^^'^'^""^ 

and «rit. op«ations u, be exec«,H 

P~v«« „o da. p:j^dr„ ~ *c ^ bu. „ 

P«*abi„^ofasi.s,edrtves,s.e„,,ai,I. rXor. rr'^'"'"'^*^ 
balancing b-, do=s „„, „ ""^'^'"''"""'^'"Sttransacdonrattsa^llMd 

access .TlT^. '"^ a disk and sn.se,nc„. ,oss or 

A RAID 1 configuration is sometimes callerfm,rw> • r ^. 
^ wnnen .o .o differ, drtves. *ns U» dll dZ^'' ^if^ 

-e.M.....ccas„.c.disks»^.sjr:ir^":,:t^^^ 

protection against U,e loss of a disk with „„ l„=. , ■ '^"'^ 
possible improvement taread tr, ''"'^'"">" =«' ' 

^provideleZ™ •"'-"'^'^'■-'-^---'-'^spac. 

O A RAID 2 configuration stripes data across the array of disks and ai^ 
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RAID 2 systems durhca-.c this redundancy infomation and require significantly more time and 
space to be cost cficcnvc. so they are seldom used. 

A RAID 3 c..n!u-urat,on implements a method for securing data by generating and storing 
parity data, and KM! » ' rr.n .Jcs a larger bandwidth for applications that process large files. In a 
5 RAID 3 confu.«rn....n r.r:,x dau are stored on a dedicated drive, requiring one drive's worth of 
data out of the ^n.^ . • ^Mvcs. in order to store the parity information. Because all parity 
information .s su.c 1 . Moglc drive, this drive becomes the I/O bottleneck, since each write 
operation muM .vsrn. t- . ^.u on the data drive and must fiirther update the parity on the parity 
drive. Hovvcv.-r ..:'v r blocks of data are written. RAID 3 is an efficient configuration. 

10 R.\ID pr ...^ pn..c:t.onagainstthelossofadiskwithnolossofwriteorreadspeeds, 

but RAID?., on.. ... :..iart..c read and write operations. The RATO 3 tnmsa^ 
thatofasm..cJ;i . . pure implementation, requires the host to read and write in multiples 
of the number ot . . - " the RAID 3 group, starting on the boundary of the number of daU 
disks in the RAi: • r 

15 AR,MI> - ^ • ..r.:K.n stores user data by recording parity on a dedicated drive, asm 

RAID - anJ tror • . of data to single disks rather than spreading data blocks across 
multiple dnvc-. configuration has no significant advantages over RAID 3, it is rarely. 

if ever, used 

^ R.Ml> ' - stripes user data across the array and implements a scheme for 

20 storing pan,^,h..:.. ..• -r.. I O bottleneck of RAID 3. Parity data are generated for each write, 
however. pxM,> srrcad evenly, or interleaved, across all drives to prevent an I/O 

bottleneck at the px-.- v . r Ihus. the RAID 5 configuration uses parity to secure data and makes 
it possible to rc..>r. .,n.-. l-st data in the event of a drive failure, while also eliminating the 
bottleneck of stonn;- ^n. on a single drive. A RAID 5 configuration is most efficiem when 
.5 >^Titing small bUK.s o. .au. such that a block of data will fit on a single drive. However. RAID 
5 requires ubcn un:.n. . Mock of data, that the old block of data be read, the old pant>' data be 
read ncwpant^K•|:cncrau:db^removingtheolddataandaddingthenewdata. Then the new data 
and the new pant, arc v.nttcn. This requirement to read, regenerate and rewrite parity data is 
temped a read modify ^ruc sequence and significantly slows the rate at which data can be wTttten 
30 in a RAID 5 conr.,ura..on Thus this requirement creates a -write penalty." To mmimtze the 
performance impact. RAID 5 stripe depth can be set to be much larger than the expected data 
transfer size, so that one bi.K:k of data usually resides on one drive. Consequently, if new data are 
to be written. onU the C.ccted data drive and the drive storing parity data need be accessed to 
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complete the write operation Thm PATn<^ ^ 

read/mod,fy^e«,ue„ce cau.es RAID 5 .o have a ■«ri»p«^,y.. 

5 In practice, RAID configurations 1 1 anw ^ „ 

igurauons J , 3, and i are most commonly used 

differtn, ai.. „0 need. *a. „o^. „e^ O^T^T "^"^ ""^ 

mixture ofthe supported RAID levels. »^ea as a disk array to use a 

There are several problems with this apmoach Th«f5rc,« u. • 
c^o^er. The ^ JZ^^T '"""'"^""'"^ 

The main solution to the first problem has been tn i;™;, ♦u 

> program, „. hy ,h. c„s.„™e, „h„ chooses a Zr!*^"™™™ """'^ 
ccnfigwauo. ™.so,„«o„„ea„.*a.:he IL^ ' 7 "T"' '"'^ 

value fen, rte RAID product Also. U»cu.,o«,r may no. ge, full 

Thesecondproblemisthalthecus<oraereilherdoesn'lk«o»„h. i. 
I/O .,,t„ ^ . . ""^"^WMchaiactenstcsofhisdisk 
l/o,orthesechaiaaensacschanseoverliine nrlwi, • . "irasoisic 

-co„«gu.,o„„__;^.,::::-^-~— 

of d,e dls, „o ca^o. ^ n^a^hed „ .he hes, RAID ..^^ " '"^ 

use .„o1r'"'"""'°"°'"*^''°"- ^•^■-''*°'"^'«^''-ddis.suhs,,e™s 
"se iwo basic measurements to evaluate these systems H,. »■ . 

Characteristics ofthe anached disks Disks ar^' -ve,s the 

-me hasic feamres use t^ « >«-'y commodities. Tltey al, have the 

customers can rpTthTdirhv^"" T" ^ ~ 

capacity^pinrate JinteJl^r ZZ " ' 

■™»fer««e. measurements can be used to direcUy compare 
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various disk products. 

The second measurement is performance when attached to a host computer. It is often 
possible to use performance tools on the host computer that will report transaction data, such as 
response time, I/O operations per second, data transfer rate, request lengths in bytes, and request 
types, such as reads vs writes. It is also common to measure total throughput by using a 
performance tool to report throughput, or by simply running applications and measuring elapsed 
time. 

A typical customer's expectation is that a new product will not be slower than the products 
the customer has been using. The customer is happy to get additional protection against the loss 
) of a disk by using a disk array, and is even willing to pay a small premium for this protection, since 
they can measure the additional cost against the additional protection. But the customer is not 
generally willing to accept slower performance because of a "write penalty". 

Disk array products will continue to be evaluated in the same manner as normal disk 
products are evaluated. In order for disk arrays to be totally competitive in the disk products market 
15 they will have to eliminate the "write Penalty" in all of the commonly used cases. 

A fourth problem with requiring the customer to set the configuration is that RAID 
manufacturers often do not allow dynamic changes to the RAID configuration. Changing the 
number of disks being used, and changing the levels of protection provided at each target address, 
often requires that data be migrated to a backup device before the configuration change can be 
20 made. After the configuration is changed, the managed disks are re-initialized and the data is then 
copied back to the disk array from the backup device. This process can take a long time and while 
it is in progress, the disk array is off-line and the host data is not available. 

TTie current generation of disk arrays appeared in the late 1980's. This generation is divided 
into completely software versions, that are implemented directly on the host using the host's 
25 processor and hardware, and versions using separate hardware to support the RAID software. 

The hardware implementation of disk arrays takes multiple forms. The first general form 
is a PC board that can plug directly into the system bus of the host system. The second general 
form is a PC board set (one or more boards) that is built into a stand-alone subsystem along with 
a set of disks. This subsystem often supports some level of fault tolerance and hot plugability of 
30 the disks, fans, power supplies and sometimes controller boards. 

Generally, the current generation of disk array systems support RAID 5, which requires 
fairly powerftil processors for the level of processing required to support large numbers of RAID 
5 requests. The controller board(s) in a disk array, as well as the fault tolerant features, increase the 
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Pric of tte di.^ ,„h,,,,em. DUc array Luftcm,^ jea, wiU, fte hi.h., . ■ . 
sWctinghardu^^K supping .oU«i,is^ 

A«oU.o, rroMcr, ,ha, disk array manufacmrcrs have IsO^m. capaciUes of SCSI H- v 
continue lo incKaK r.r.JI> a, ,he cost of ,h, A- t o«MCines of SCSI disks 

resulted in the „ ,j , , k, """"^ ^ ™» <™d 

the „.,j ,„ ,K,c to supply disk amys that have sntal. „u„,hers of disks (3^, 

. pp. n n..L,p.. SCSI channels, typieally eight or tnoi*. and a SCSI I ch^, 
suppon SIX o' • • 1 ciiannel can 

. „; : :: :: ;r::;rof°r "'"'"'*°"""""°-*"'^^"'- 

powerful e.,u .... - I »f TOd te,^ eo«,„Uer b„aW(s) that are 

cheape„oJ:::r:;T 7^°'*«*^^«"— •^'-^.ntea.e 
t, u,,,naJ,skarraysubsystenithatoniyhas3or4disks 

isanotho . -"-"he data stored in the array. There 

»^t^nccd.n,,...,....,,.^^._„^^^^^ 

present mven,.on m.-c. ^ other needs in the art. 
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30 



Or.SCLOSURE OF THE INVENTION 
(RAID 'l'!!"'""^ ■" " ^"^^"'^""^"^-.y oflndepe^en, Disks 

<.™o„!L::t:r:.ro.r"'"'~'"'^"^- 

present rrd,'."";:"' " >» -""^ - *e array. .hi,e any data 
present on the d.sK. uhcn it is added, remains available 

<.n,thetar"*'"'"''"'""°"™'^""°"^-^~"''''-----^~ 
A SUU fnrthc, aspect . to provide a systen, that nsntUly rentoves the »Hte penalty v*i,e stUl 
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providing full RAID functionality. 

The above and other aspects of the invention are accomplished in a RAID system that is 
adaptable to host I/O reads and writes of data. The RAID variations are hidden firom the host, thus 
the system removes the need for a customer to understand the various possible variations within the 
5 RAID device. Configuration of the system requires only that the host/customer/system 
administrator provide a level of configuration that defines the target addresses (such as SCSI 
ids/LUNs) to which the disk array must respond, the capacity of the defined target addresses, and 
whether the data at each target address is to be protected against the loss of a disk. 

Tbs determination of the RAID variation used to store host data is made dynamicaUy by the 
10 disk array of the present invention. This determination is made to maximize response time 
performance and also to minimize the loss of-disk space used for providing protection against the 
loss of a disk. 

The RAID variation can be changed dynamically, on-line, while the data remains available 
to the host and can be modified by the host. These changes in variation allow the system to 
15 reconfigure itself to allow a disk to be deleted from the array, or be added to the array. In addition, 
a disk being added may have existing data, and this data also remains available to the host and 
modifiable by the host. After the disk is added, its data will be striped across all the disks of the 
array. 

The system also hides the variation changes necessary for the addition or deletion of disks 
20 to the disk array. While these changes are in progress, the disk array remains on-line and all host 
data is available for access and modification. Additionally, the blocks associated with each target 
address can have their characteristics changed while the data remains available and modifiable. 
Thus the host can dynamically add new target address entries, change the number of blocks 
allocated to the entries, and change the protection afforded to the entries. 
25 To maximize response time, small write operations are written into data blocks organized 

as a RAID 1 configuration, so there is no ^-rite penalty. These RAID 1 blocks are re-written into 
data blocks organized as a RAID 5 configuration, as a background operation, to minimize the disk 
space lost. 

To maximize response time, medium and large write operations are written into data blocks 
30 organized as a RAID 3 configuration, to prevem a write penalty, to maximize bandwidth 
performance, and to minimize space lost to providing protection. 
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^ , DESCRIPTION OF THE DRAWINGS 

*e foilowing " ""^ in conj»c.„„ 

control module of IhcpresenlinvmUon- "-^gMbya 
F*2^owsU,.Moc.di^.3n,ofn, , 

.0 -^-^ — 

Figs. 6-9 show the traiKparen, RAID data otsanization- 

Z ' uTh ^ '""^ ^"'"-•^ -P-« RAID. 

Fig. , 7 shows a flowchan ^^^^^^ 

2 ' *o„s a flowchan of the process of ™.„vi.g a disic ^ ^ 
F,g. 19 shows Ihe block layout of adapUve RAID- 

Fig. 20 shows a flowehan „f ^ ^.^^^ ^ ^ 
F,g.2 ^^owsaflowchartofthep^cessoft^nt^ag...^^^^^"* 
F.g. 22 Shows a flowchan „f the adapUve RAID write opention; and 
F.8. 23 Shows a flowchan of ^^^^^ ^ 

BEST MODE FOR CARRy^c oUT THE INVENTION 

be detenninea by referencing the appenl, Cail ""^ "^"^ invention shouM 

.na«>icalope,adngsy3.e„.suchasd«u..(™,„pe^,_,^^^^^^^^ 
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independent entities. These disks have mountable file systems defined to use part or all of a disk, 
however, typically a file system cannot span across more than one disk. Thus a Unix system with 
4 disks would have at least four mountable file systems. Normally a single application will use a 
set of files that aU reside on the same file system. 
5 Fig. 1 shov>« a computer system 100 having a host computer 102 connected to four disks. 

Ahost SCSI bus 104 connects the host computer 102 to a control module 106. Those skilled in the 
art vsriU recognize that any type of I/O bus that connects the host computer 1 02 to the controller 106 
will function with the invention. The control module 106 is connected to four disks 110, 112, 114. 

and 1 16 through a SCSI bus 108. 
10 Fig. 1 also shows that the contn>l module.106 is capable of responding to all of the SCSI 

device ids and logical unit numbers (LUNs)ofthe managed disks. The control module 106 
responds to the set ofSCSI ids andLUNs that were originally used forthe disks 110. 112, IM.and 
1 16 The SCSI id/LUN that the control module 106 responds to may not have the data that is being 
requested by the host, however, the host computer 102 will still access the same SCSI ids that were 
15 available when the managed disks 1 10. 1 12. 1 14, and 1 16 were directly connected to the host 
computer 102. The control module 106 will respond using the same SCSI ids and with the same 
edacities and characteristics that were available when the managed disks were directly comiected 
to the host. 

The original data on the disks is redistributed and evenly striped across the disks bemg 
20 managed by the control module. The effect of this striping is to cause a single application's data 
to be evenly striped across all of the managed disks. 

In an un-striped configuration, the worst case performance occurs when a single application 
accesses all of its data on a single file system, which is on a single disk. The best case occurs when 
multiple applications perfonn a large number of disk requests, resulting in accesses to aU the file 
25 systems and disks, to provide the best overall throughput 

With the striping provided by the control modules, the worst case performance also occurs 
v.hen a single application accesses all of its data on a single file system. But since the data is 
striped across all the disks being managed by the control modules, the accesses will tend to be load 
balanced across all the disks, so that the worst case operates at the same level as the best case 
30 operates in an un-striped configuration. Therefore, the best case, and the worst case performances 
for the striped data configuration are the same. 

V.'h.n the control module is managing a parity disk, the associated SCSI id/LUN used by the 
managed parity disk is not available to the host. That is, the host camiot use the parity disk SCSI 
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id/LUN u, commuaicaK vrift u„ «, of ^^g,;,''^,^ 

wherem U,c fifth .i^ 2,8 is <W1„«, as a parity,;^ ' f^""' 
id=^UN,fo,^.a^,„„^ TTieseSCSI a ^'""^ ca„ us. U,. SCSI 

5 ofAefimfc„<iisks2,o.2,2 2Ua„H,,^ "^^^ ™'' «P«=i<i« a™i charaortaics 
Hie user daiaivrinen by ihe host compmer 202 ,,«ri j 

~r m. DMA e^es 3.0 « S^T^T"""' 
^ .04 au. n4.a„.acache™e„„,r:^,r,": "T" 

"taa bioclcs. .o be managed ^ » assu„ *a, al, J " °f 

-pu^. When „u«p,e a-e n^^Z T ^ ^ 

-h recaugie has a se. ofdisics *a. a, ItlT '""""'"'■'"■'""^'^■"^ 

-JiecTzir^iiritr:"'^^^^ 

-.-..s.a.i.defi„esu,errf~0llt2::"7'°"^"-^ 
™«i-g on ttis dist, i„ excess of u,e spTul " ° 

S^..ly,.e.e„ai„i„..paeeo„dis.T2^:r:^''r'""""'°^'"'"^'='- 
^ ^pace o„ disic 3 408 defies ,he si^e of rectanglef 

Because of U,e number of dislcs in rectangles 0 and I ,b 

-«n, rectangie. ™eac„.,„«i„„„,^^'^ ~f °» be d.e s»e 

be readily detennin^i when the disk is accesll ^ " - '-8 - can 

-other goal ofthesyste™is.oa„o„dis.s.hatha.da.a .ready storedonthe^tobe 
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incoiporated into the set of managed disks, and to allow the data from a new disk to be spread 
across all the managed disks to provide significantly higher levels of performance and to allow 
protection against the loss of a disk. Still another goal is to dynamically add or remove a disk from 
the set of managed disks while maintaining the integrity and availability of the data stored in the 
5 system. 

To accomplish these goals, each rectangle is divided into a set of "squares". A square is a 
portion of the set of disks contained within a rectangle. The number of blocks in each square is 
equal to the number of disks in the rectangle multiplied by the depth being used by die rectangle. 
Each square typically starts at the same logical block number on each disk. 

10 Since the number of blocks in a rectangle is not necessarily an even multiple of the number 

of blocks in a square, there may be a "partial" at the end of the rectangle, and this partial portion 
contains the remaining blocks in each disk that cannot fit in a square. These partial blocks do not 
participate in tfie striping operation, described below, and thus remain un-striped. They will have 
data protection, however, since parity can be maintained with an un-striped configuration. 

1 5 Fig. 5 shows an example of a rectangle and some squares that fit into the rectangle. Referring 

to Fig. 5, four disks 502, 504, 506, and 508 are shown, wherein the disks comprise a rectangle 
containing 1000 blocks on each disk. In this example, the depth is four blocks, and since there are 
four disks, each square contains 16 blocks on each disk. In this example, the rectangle contains 61 
squares, and there are six blocks left over on each disk. These left over blocks comprise a partial. 

20 The squares organization is used to allow data to be striped and un-striped across disks. Since 

each square has the same number of rows and columns, wherein one row is a depth's worth of 
blocks, and there is one column per disk, mattix transposition is used on a square to stripe and un- 
stripe data blocks, as will be described below with respect to Figs. 11-16. 

The management of data on the disks of the array is layered. At the first level is the 

25 management of striping of data blocks and possibly parity blocks. The first level of management 
is also responsible for sparing and reconsuiiction operations. This level is called transparent RAID. 
The second level of management is adaptive RAID, as will be described below. 

In transparent RAID, the only configuration information the host/user/system administrator 
can specify is that an added disk is to be used as a data disk, a parit>' disk or a spare disk. The disk 

30 array uses a disk as a data disk if the type of disk is not defined. The host/user/system administrator 
can also specify the depth, that is the number of blocks written on a specific disk before writing 
moves to the next disk. 

In transparent RAID, the data blocks on each disk and the parity blocks, if a parity disk is 
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being used, are automaticaljy striped across all nf th« 

all disk, i„cl«„ng U>c ,„w ^ ^!tr " 

striping operation are done as backor . operations for the re. 

If the disk array is shut down durin« the re ««« 

corxectly.and whenthediskarrayisrebootL ^ 1 ^" ""^ " '''^^^^^^ 

15 where it stopped. ^''"^'"^ "P^'^'^"" ^" from the point 

During the process of re-stripine it mav K» 
.-midon variai.^. "-^ " «~-s«y » go u^„gk 

Transparent Variattons 

Transparent RAID supports inmnber of vahatioM Th,„. - • 
20 and r^^^ ^ ^ ~ ™= -n«,o,« ate used ,« allow adding 

PtcscvingallexisUngdataonU^setrftl / '° 

deleted. ^""™^«'**^^'«"=='0"ft= disks being added and/or 

The variations supported are: 
transparent non-striped, 
25 transparent striped, 

protected transparent non-striped, and 
protected transparent striped 

~::ire:::::r™r " ^^^-^ - - - 

30 Transparent Non-striped 

This transparent RAID variation is a direct pass through of H , 
and the« is „o protection agair^t the loss of. H" T " 

• . °^*^'^'''^"'ce no parity disk is ^nnn^rt-H r 

this vanation treats the disks as totally independent. " 
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Fig. 6 shows four disks 602, 604, 606, and 608 being managed as transparent non-striped, 
and each disk has its own SCSI id and LUN. Host SCSI requests are passed directly to the managed 
disks without any id mapping. 

Transparent non-strif)ed is one of the base transparent variations used to allow the addition 
and/or removal of disks. Since there is no striping, the data blocks for each of the data disks are 
completely contained on the corresponding data disk. 

In this variation, when a disk is added to the managed set of disks, it is made immediately 
available to the host as soon as the disk array completes power up of the disk. In addition, any data 
that was stored on the added disk is also available to the host. 

In this variation the host/user/system administrator can also remove any of the disks from 
the managed set at any time. Once the disk array is notified to remove a specified disk, the disk 
array will not respond to any host references to the associated SCSI id and LUN of the removed 
disk. 

Transparent Striped 

In this transparent RAID variation, there is no parity data, but the data is striped across all 
of the managed disks using a depth defined by the host/user/system administrator, or the default 
depth if none was defined by the host/user/system administrator. To the host, there will still appear 
to be the same number of SCSI ids that were present when the disks were directly attached, and 
each of these disks will have the same number of blocks that were available when the disks were 
directly attached. This supports load balancing of improtected data. 

Fig. 7 shows four disks 702, 704, 706, and 708 in a managed set. The array still responds 
to SCSI ids 0-3 when the host selects these SCSI ids, but the data is striped across all four of the 
disks. For example the curved line in each of the disks 702, 704, 706, and 708, represents that the 
data that was originally stored on the disks is now striped across all the disks. 

The rectangles organization, discussed above, is used for all managed disks in all transparent 
RAID variations, except for transparent non-striped. The rectangles organization is one which will 
allow all data blocks to be available even when the disks being managed have varying sizes. 

The Squares organization, discussed above, is also used for all managed disks for all the 
variations except transparent non-striped. The Squares organization fits within the rectangles 
organization, and allows the data in the managed set of disks to be transposed from a non-striped 
layout to a striped layout, and vice versa, while remaining on-line, and without requiring any disk 
space to be removed from use by the host/user. 

The main feature of the transparent striped variation is that accesses by the host to a single 
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SCSI id and LUN are distributed across all of the managed disks, thus giving possibly higher levels 
of throughput and/or response times to the host without making any changes to the host disk driver 
software. 

The main drawback of this variation is that it is not protected and the data for all the 
5 managed SCSI ids and LUNs are striped across all disks. Thus a single lost disk will probably 
effect the users of all SCSI ids. instead of just the users who were specifically placed on the lost 
disk. Additionally, when die lost disk is replaced it will probably be necessary for the data for all 
of the SCSI ids, that is all disks, to be restored since all SCSI ids will be missing the data that was 
on the lost disk. 

10 Protected transparent non-striped 

This transparent RAID variation is used to protect a set of managed disks, that^o not have 
the data blocks presently striped, by using a parity disk. This variation is similar to transparent non- 
striped except that the user blocks are protected against the loss of a disk. This variation appears 
to the host computer to be the same as the transparent non-striped configuration when the 
host/user/system administrator wants to add and/or remove one or more disks from the managed 
set of disks. 

Fig. 8 shows four data disks 802, 804, 806, and 808 that are accessible by the host using die 
associated SCSI id and LUN supported by the disks. The user data is not striped. The fifth disk 
810 is a parity disk and contains parity data built from the other four disks. Tlie parity data is 
completely contained on die parity disk. TTiis parity data is simply the exclusive OR of all the data 
on disks 802, 804, 806, and 808. done on a bjte by byte basis. For example, the first byte of a block 
of data on disk 802 is exclusive ORed with the first byte of data of a corresponding block on disks 
804, 806, and 808, and die exclusive OR result is placed in die first byte of a corresponding block 
on disk 8 1 0. All odier bytes of all other blocks are done die same way. such diat all the data on disk 
810 is the exclusive OR of all the data on die other disks. This parity data can be used to 
reconsmict the data on any one of die data disks 802, 804, 806. or 808, in die evem diat a data disk 
fails. The mediod of reconstructing this data is well known to tiiose skilled in the an. 
Protected Transparent Striped 

This transparem RAID variation is die normal transparent RAID variation that is used by 
adaptive RAID (described below). This mode has completely striped data as well as completely 
striped parity data across all the disks in the managed set of disks. 

Fig. 9 shows four data disks 902. 904, 906. and 908 that are accessible by die host using die 
associated SCSI id and LUN supported by die disk. The user data is striped. The fifth disk 910 is 
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defined as the pani> di %k but it contains striped user data as well as striped parity data. 

This is the normal R-\ID 5 configuration using a set depth that will support the loss of one 
disk without losing' ;:n> host data. 
Sparing 

5 One or m;»ri- sr.!-.-. c.ir specified to be added to support the data in the configuration 

against loss of ou. i>f rr disks of user data. When a data disk fails, one of the available spare 
disks, if there i> 4 nc j uiijric. is automatically chosen and added into the configuration. The 
blocks in the ip-rc J: k built using data re-generated from the remaining disks in the 
configuration Wlui^ tr. rrp J^cmcni process is in progress, the configuration has three parts. The 

1 0 fnst pan conuir.N i:k irc Ji k wuh rebuilt data that has replaced the failed disk. The second part 
contains the I Kkk x c ^ urrenily being used to rebuild the data for the spare disk, and this part 
is locked out o:.»< • u < r \* mlc it is being rebuilt. The third part contains the configuration that 
contains an c»f:l:r.. J: i tr^ u:\cd disk, and requires references to data on the off-line disk to be 
dynamicallx t:cncrj:;j u^;-:/ the other disks. 

15 If a xarijtH-:: trirv f^.%:!ion to add or delete disks is in progress when a disk fails, the: 

transposition oTx-rjti. '. v^* i i.»mp:cie the active square being transposed, so the lock around that 
square con Kr rem* \ c : T >.<rr, vic transposition is suspended until the sparing operation completes. 
Once the spanni- itvj- t vi»mplete, the transposition operation will continue to completion. 

When a ^r. * ct. r . ..n^- disk is replaced by an operable disk, the new disk will be treated ^ 

20 as the spare and be rr.Ajc j\a laSle for sparing operations. 
Depths 

The pri»|x: Jath^ :xr used arc 'impendent upon the characteristics of the data. Shallow 
depths cause the rcjJ ^r.J unte operations to. cross boundaries, thus involving multiple disks in a 
single iransacuon This ^rv^^mv causes overall throughput in the system to be impacted, since the 

25 system will be able tp pr;Kc%> lewcr concurrent requests. A deep depth will reduce the number of 
boundar>- crossing::. K-t it hj> wvcral disadvantages. The first disadvantage is that a deep depth will 
cause reads or \^Tlle^ \Min hi^-h hKality to bottleneck on a single disk. The second disadvantage is 
that a deep depin lenJx it> eliminate the possibility of doing RAID 3 writes or RAID 3 broken 
reads as cffeciix cl> a> pi>iMblc. 

30 One wa> i;» determine the approoriate depth is to keep a set of heuristics to detect 

characteristics that can be used to choose a more appropriate depth. The type of heuristic data 
needed might be: 

1 ) length of requests - if a particular length was predominant, pick a depth that 
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correq)onds well to the request length. 
2) boundaries of requests - if the requests are of a particular length, and they fall on 
particular boundaries, such as multiples of some number, that number can be used 
for the depth. 

5 3) break statistics into a small number of buckets to allow for more than one set of 

length and boundaries. 
Also, in order to support the squares fomiat, depth must be limited to a reasonable size that 
will allow the transposition of a square in a short period of time, typically milliseconds or less. 
Blocks in a square cannot be locked out from a host for a long period of time, such as seconds, or 
10 performance may be tmacceptable. 

To operate efficientiy and effectively in accessing, updating, and protecting *e host data, 
the system nonnally operates in either the transparent striped or protected transparent striped 
variations. However, before addmg or deleting disks, the system must be operating in either the 
transparent non-striped or protected transparent non-striped variation. Therefore, the system must 
15 transit between the different variations. 

Fig. 10 shows which transitions between transparem RAID variations can be performed. 
Referring now to Fig. 10, the transparent non-striped variation 1 002 exists when the disks are first 
placed under management of the array. From this variation, a new disk can be added or removed, 
as shown by circle 1006. Also from the transparent non-striped variation 1002, the data can be 
20 striped over the disks being managed to move to the transparent striped variation 1004. 

From the transparent non-striped variation 1002, a parity disk can be added, and parity 
accumulated, to cause a transition to the protected non-striped ^'ariation 1 008. From die protected 
non-striped variation 1008, the data can be striped across the disks for transition to-the protected 
transparent striped variation 1010. 

As Fig. 10 shows, the system cannot move directly betweeri the transparem striped variation 
1004 and the protected transparem striped variation 1010. Ifthis type oftransition is required, the 
system must move through variations 1 002 and 1 008 to complete the transition. 

Fig. 1 1 shows a flowchart of the process of transposing the data to accomplish the 
transitions as described in Fig. 10. Fig. 1 1 is called whenever the system needs to change the 
transparem RAID variation. Referring now to Fig. 1 1, after entry, block 1 102 determines if the 
requested transition is between the transparem non-striped variation and the transparem striped 
variation. If so, block 1 102 transfers to block 1 104 which calls the process of Fig. 13 to stripe the 
data on the disks. After striping all the data, control returns to the caller of Fig. 1 1 . As described 
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above, data in the last, partial portion of the disk will not be striped. 

Block 11 06 determines if the transposition is between the transparent striped variation and 
the transparent non-striped variation. If so, block 1 106 transfers to block 1108 which calls the 
process of Fig. 13 to un-stripe the data on the disks. After un-striping all the data, control returns 
5 to the caller of Fig. 1 1 . 

Block 1110 determines whether the transposition is between transparent non-striped to 
protected transparent non-striped. If so, block 1110 goes to block 1112 which exclusive ORs the 
data within blocks, as described above, to create parity data and store this data on the parity disk. 
Block 1112 then returns to the caller. 
10 Block 1114 determines whether the transposition is between protected transparent non- 

striped and protected transparent striped. If so, control goes to block 1116 which calls the process 
of Fig. 13 to stripe the data across the data disks. Block 1 1 18 then calls the process of Fig. 12 to 
distribute parity over all the disks. Control then returns to the caller. 

If the transposition is from protected transparent striped to protected transparent non-striped, 
15 block 1114 goes to block 1 120 which calls the process of Fig. 12, once for each square, to combine 
the parity data onto the parity disk. Block 1 122 then calls the process of Fig. 12, once for each 
square, to unstripe the data. Control then returns to the caller. 

Fig. 12 shows a flowchart of the process for distributing or combining parity data over the 
managed disks. Referring now to Fig. 12, after entry, block 1202 selects the first or next rectangle. 
20 Block 1204 then selects the first or next square within the selected rectangle. Block 1206 positions 
a block position pointer to the first block in the square. All operations of blocks 1212 through 1 222 
are done relative to the block position pointer. 

Block 1212 selects the first, or next, depth group within the square. A depth group is the 
number of blocks in the depth, over the set of managed disks. 
25 Block 1214 then reads the number of blocks equal to the depth from the disk having the 

same number as the depth group. For example, if the depth were two, and if the second depth group 
is being processed, block 1214 would read two blocks from the second disk. 

Block 1216 then reads the nxmiber of blocks equal to the depth from the parity disk. Block 
1218 then writes the parity disk data to the data disk, and block 1220 writes the data disk data to 
30 the parit>' disk. Block 1222 determines if there are more depth groups in the square, and if so, block 
1222 returns to block 1212 to process the next depth group. 

After all depth groups in the square are processed, block 1222 goes to block 1208 which 
determines whether there are more squares in the rectangle to process. If there are more squares 
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to process, block 1 20.v ^'ocs to block 1204 to process the next square. 

After all squarcN in the rectangle are processed, block 1208 goes to block 1210 which 
determines whcilicr a!! rccunples within the managed disks have been processed. If there are more 
rectangles to procc% r i w k 1210 goes to block 1202 to process the next rectangle. 
5 After all recur. - . nj\ c been processed, block 1210 returns to its caller 

Figs. 1 4 a-iJ 1 Nf.ovv an example of the process of combining parity. The process of Fig. 
12 is also follimcJ :.>r j: tnbuling parit}'. 

Figs. 1 3. A onJ ! * !i :.ht»\\ a flowchart of the process of striping or un-striping data within the 
system. Refemni: u t \ i/ ! > A. after entry, block 1302 selects the first or next rectangle. Block 
10 1304 then dcicnrunc* i: ^'.i rectangles have been processed, and returns if they have^ - 

If any rccian. ir r jn;am. block 1304 goes to block 1306 which selects thdVfirst or next 
square within the rcwui: . r .cicjicd in block 1302. Block 1306 determines if all squares within this 
rectangle have hccr p? r .-cJ. and if they have, block 1308 goes to block 1302 to get the next 
square in the sciccicJ Tr.ur.^-lc 
15 If all squaxcN r.j. v f>. i been processed, block 1301 sets a block position to the begiiming of 

the square Tlu- bi. k k r* ^iTion is used in all square processing as the origin of the block, so that all 
other block select:* »r u :--.:r. ihe block are relative to the block-. 

Block 13:: w: . t c J.-p-ii group mmiber to zero, and block 1314 selects the first or next data 
disk sioninp nmiH dau .» .-eri> Block 1316 skips past a number of blocks to position at the block 
20 equal to ihe depth i mr ti. .i.itn disk number +1. This block is the first block to be exchanged. 

Block rOKva:i.»i. 1 3B to exchange data at this location, and then block 1320 determines 
if all the blocks on tr.r disk, for the entire square, have been processed. If not;^ block 1320 
returns to block P' U u- continue processing this data disk within the square. 

After the entire aI. ihc blocks on this data disk within the square have been processed, block 
25 1320 goes to block 1}Z2 uhich increments the data disk number, and also sets the depth group 
number back to zcn* liit^k 1 324 then determines if all data disks within the square have been 
processed, and if not. returns lo block 13 16 to process the next data disk within the square. 

After all data disks m the square have been processed, block 1324 returns to block 1306 to 
process the next square 

30 Fig. 1 3B shows ihc process of exchanging data within a square. Referring to Fig. 1 3B, after 

entry, block 1 350 rcad^ a depth's worth of blocks (i.e. a number of blocks equal to the depth), at 
the location defined iniiiall> by block 1316. Then block 1316 skips past the number of blocks it 
reads to leave the pointer a: the next block after those already read, in preparation for the next pass 
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through this block. 

Block 1352 then reads a depth^s worth of blocks from the data disk that has a number equal 
to the data disk selected in block 1314 plus one plus the depth group number. On this disk, the 
blocks are read from the location computed by multiplying the disk number (from block 1314) by 
5 the depth. 

Block 1354 then exchanges these two depth's worth of blocks, and block 1356 increments 
the depth group number before returning to Fig* 13 A. 

Figs. 14, 15, and 1 6 show the data organization of a square of data, and illustrates how this 
data moves during the Uansposition between some of the variations. In the example of Figs. 14, 
10 15, and 16, the depth is equal to two, there are four data disks, and one parity disk. Also, in this 
example, the data blocks are numbered, while the parity blocks for each depth group are represented 
by the letter "P". 

Fig. 14 shows an example of how data is stored in a square within the protected transparent 
striped variation. Specifically, Fig. 14 illustrates striped data and distributed parity. 

15 Applying the flowchart of Fig. 12 to the data organization of Fig. 14 results in the data 

organization shown in Fig. 1 5, which shows striped data and combined parity. In this example, the 
depth's worth of blocks outlined by the dotted lines 1402 and 1404 are exchanged using the process 
of Fig. 12. Similarly, the other parity data is exchanged with the non-parity data resulting in the 
example of Fig. 15, which shows combined parity data within the square. 

20 Applying the flowchart of Figs. 13A and 13B to the data organization of Fig. 15 results in 

the data organization shown in Fig. 16, which shows un-striped data and combined parity. For 
example, the depth's worth of blocks outlined by the dotted lines 1502 and 1504 are exchanged, as 
are the depth's worth of blocks outlined by the dotted lines 1506 and 1508. Similarly, blocks 
outlined by 1510 and 1512 are exchanged, blocks outlined by 1514 and 1516 are exchanged, the 

25 blocks outlined by 1518 and 1520 are exchanged, and the blocks outlined by 1522 and 1524 are 
exchanged to provide the data organization of Fig. 16, which is non-striped and combined parity. 
Add A Disk 

When the host/user/system administrator requests that the disk array add one or more disks 
to the set of managed disks, the system must change the managed disks to a particular transparent 
30 variaiion, as discussed above with respect to Fig. 10. A request to add one or more disks by the 
hosi/user/system administrator will be delayed any lime there is already a transition operation in 
progress, or any time there is a sparing operation in progress. 

If a disk is to be added while in the protected transparent striped variation, the new disk is 
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first added to the set of managed disks as a transparent non-striped disk. This makes it immediately 
accessible to the host, unless it is to be a added as a parity disk. If the disk already contains user 
data, this data is also immediately available to the host, and the data will be striped along with the 
other data on the other disks. 

Fig. 1 7 shows a flowchart of the add disk process. Referring to Fig. 17, after entry, block 
1702 makes the new disk available to the host computer, as a transparent non-striped disk, if the 
disk is to be a data disk. Block 1704 then unstripes the existing disks, by calling Fig. 11, to 
transpose the parity blocks and then transpose the user data blocks for each square on the existing 
disks. Block 1706 then includes the new disk in the configuration, and block 1708 calls Fig. 1 1 to 
transpose the data and parity on the disks, including the new disk, in order to re-stripe the disks. 

As the transition proceeds, the variation will be altered to reflect the changes to the data 
layout on the managed disks. That is, once a square has been n-ansposed, its variation is changed 
to reflect its new organization, either un-striped or striped, protected or non-protected, depending 
upon the particular transposition in progress. Thus, during the u-ansition, the system manages the 
disks as partially striped, partially un-striped, protected or not protected, as the transposition is 
completed. This allows the data to be available during the transposition, and only the data in a 
square currently being transposed is not available, and this data is only not available during the 
short time that the transposition of the square is in progress. 

If a shutdown is requested during the transition, the transposition of the active square will 
complete before the shutdown will be honored. 

If the new disk being added is a parity disk, it is not made available to the host, since parity 
disks are not ordinarily available to the host computer. The system will unstrip the existing disks, 
and strip the new set of disks and regenerate parity, to include the parity disk. ^ 

If the existing disks did not have parity, that is, they were a transparent striped variation, the 
process proceeds as in Fig. 17, except that there is no parity to transpose. 
Remove A Disk 

When the host/user/system administrator requests that the disk array remove one or more 
disks from the set of managed disks, the system must change the managed disks to a panicular 
transparent variation, as discussed above with respect to Fig. 1 0. A request to remove one or more 
disks by the host/user/system administrator will be delayed any time there is already a transition 
operation or sparing operation in progress. 

Fig. 18 shows a flowchart of the remove disk process. Referring to Fig, 18, after enuy, 
block 1802 unsuipes the existing disks, by calling Fig. 1 1, to transpose the parity blocks and then 
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to transpose the user data blocks for each square oil the existing disks. Block 1 804 then removes 
the disk from the set of managed disks. Block 1806 then calls Fig. 1 1 to transpose the data and 
parity on the remaining disks in order to re-stripe the disks. 

As the transition proceeds, the variation will be altered to reflect the changes to the data 
layout on the managed disks. That is, once a square has been transposed, its variation is changed 
to reflect its new organization, either un-striped or striped, depending upon the particular 
transposition in progress. Thus, during the transition, the system manages the disks as partially 
striped and partially un-striped, as the transposition is completed. This allows the data to be 
available during the transposition, and only the data in a square being transposed is not available, 
and this data is only not available during the short time that the transposition of the square is in 
progress. 

If a shutdown is requested during the transition, the transposition of the active square will 
complete before the shutdown will be honored. 

If the new disk being removed is a parity disk, the system will un-stripe the existing disks, 
and stripe the remaining disks without parity. 

If the existing disks did not have parity, that is, they were a transparent striped variation, the 
process proceeds as in Fig. 1 8, except that there is no parity to transpose. 
Adaptive RAID 

The second level of management is called adaptive RAID. Adaptive RAID is built on top 
of transparent RAID, speciflcally the protected transparent striped variation. 

Adaptive RAID requires configuration information from the host/user/system adminisuator. 
Using adaptive RAID, the set of managed disks will appear to the host/user/system administrator 
as a collection of blocks. The host^user/system administrator defines a set of SCSI ids that have a 
specified nimiber of blocks associated with each id. The host/user/system administrator no longer 
has a view into the way the blocks on the managed disks are organized or managed. 

Adaptive RAID does not deal with adding and removing disks from the set of managed 
disks. Instead, when a host/user/system administrator requests that a disk be added or removed 
fh>m the set of managed disks in the disk array, adaptive RAID is turned off, and the system reverts 
to the protected transparent striped variation of transparent RAID. Once the transition is made to 
the protected transparent striped variation, disks can be added and/or removed as defined above. 

When using adaptive RAID, a data disk can only be removed if there is enough disk space 
available, minus the space of the disk being removed. If there is not enough space, the operation 
will be rejected. Also, a parity disk cannot be removed while adaptive RAID is in use. 



wo 97/07461 



PCTAJS96/13238 



22 

In adaptive RAID, each disk is treated as a set of linked groups of blocks. Initially, there 
is a single group of blocks comprising all the blocks in the disks. This group is called the block 
pool. The allocation of a block group, defmed below, is taken from the block pool. 

Figure 19 shows an example of the allocation of blocks. Referring to Fig. 19, three block 
groups 1902, 1904, and 1906 are shown as linked lists of blocks. A linked list 1908 contains the 
remaining available blocks, called the block pool. When a read or write request is received, 
adaptive RAID mapping data structures are used to map the blocks requested by the host into the 
blocks managed by the transparent RAID. Since all transitions are managed at the transparent 
RAID level, the adaptive RAID mapping interface to the host interface works regardless of whether 
the adaptive RAID features are turned on or off. ^ 

The structures that support adaptive RAID are always built on a protected transparent striped 
variation. This variation is the middle ground between the adaptive RAID structures and the 
variations that allow for disks to be added and removed from the set of managed disks. Any time 
a disk needs to be added or removed from the set of managed disks, adaptive RAID is turned off 
and the portions that have been used to expand the configuration beyond transparent RAID will be 
collapsed back into a normal protected transparent striped variation. While this change is in 
progress the set of managed disks will remain on-line and accessible by the host. The only effect 
of turning off the adaptive RAID features is that performance may be impacted because the array 
will only be supporting normal RAID operations. 

Once the additional adaptive RAID pordons in the configuration have been collapsed back 
to a normal protected transparent variation, the striping will be removed by transposing into a 
protected transparent non*striped variation. After this transposition is complete, the disks are added 
and/or removed. After all outstanding additions and deletions of disks are completed^ the process 
is reversed, and the disk array will again support the adaptive RAID features. 

Transparent RAID allows the management of a disk array to provide load balancing ( RAID 
0) and/or protected data ( RAID 5), Providing adaptive RAID requires configuration informauon 
from the host, at a simple level. The host must specify a set of one or more block groups. A user 
specified block group comprises: 

an id to be used by the host for communicating with the disk array. For SCSI 
interfaces this is a SCSI id and a LUN. 

the number of blocks to be assigned/allocated to each block group. These blocks 
are logically numbered from 0 to n-1 where n is the total number of blocks 
allocated. 
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an indication of whether or not the blocks are to be protected. 

an indication of whether or not to initialize the user data blocks to a value of binary 

zero. 

These block groups can be added, deleted or modified at anytime by the host while the disk array 
5 is on-line. All existing block groups continue to be on-line and accessible during block group 
changes. 

When a new disk is added to the disk array, the blocks on the added disk are added to the 
block pool list 1908 within the disk array. As the host defines and adds a new block group, the 
space for the new block group is taken from the available blocks and reserved for the new block 

10 group. The total space specified by the defined block groups includes the parity space needed to 
provide RAID 5 operations for all protected block groups. The blocks left over from the allocated 
block groups are used as a block pool to manage adaptive RAID features. Any time the block pool 
is exhausted, for example because of a high number of host requests, the disk array will revert to 
transparent RAID operations, so the host must leave an adequate amount of unallocated space for 

15 the block pool. The amount of space necessary depends upon the access rate. 

Fig. 20 shows a flowchart of the process of creating a new block group. Referring to Fig. 
20. after entr>', block 2002 receives an id from the host to use for the new block group. Block 2004 
receives the number of blocks to allocate to the new block group from the host. Block 2006 
removes the number of blocks defined in block 2004 from the block pool and block 2008 connects 

20 these blocks to the new block group. Block 2010 then assigns the id received in block 2002 to the 
new block group, and if initialization has been requested, block 2012 initializes them to binary 
zero. The host must perform any other desired initialization. 

Fig. 21 shows a flowchart of the process of removing a block group. Referring to Fig. 21 , 
when an existing block group is released by the host, block 2102 removes the blocks from the block 

25 group, and block 2104 places all the block space removed from the block group into to the block 
pool. Block 2106 disables the block group id so that the disk array will stop responding to the 
block group id. 

The host specified features of an existing block group can also be changed dynamically. 
If the size of the block group is increased, the additional blocks are allocated from the block pool 
30 and added to the end of the block group's list. The additional blocks will be initialized to zeros, if 
requested, and the additional blocks will have valid parity if the block group is protected. If the size 
of the block group is decreased, the specified number of blocks are removed from the end of the 
block group, and added to the block pool. 
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The protected state of the block group can be changed, from protected to unprotected or vice 
versa, in the same manner as transparent RAID. Although this can be a long running operation, 
depending on the size of the block group, the block group is accessible to other requests while the 
protected state change is in progress. 
5 Operation of adaptive RAID 

The block pool entries are used in to two major ways: 

1) When a small write operation is made, a block pool of some minimum size is 
allocated and given a squares portion that is luiked into the appropriate location in 
the squares portions lists. This block pool entry will be defined using a RAID 1 

1® configuration. This block pool entry will likely be wider than 2 disksP^This squares 

portion is treated specially to allow multiple groups of RAID 1 entrie^to be created 
and used. 

2) When a larger write operation is made, a block pool entry is allocated and used to 
provide RAID 3 write operations. The parity data for this block pool entry is not 

15 striped, instead, it is always written to the parity disk. 

As data areas in normally striped squares portions arc replaced by block pool entries, the 
entire square may be replaced and added to the block pool using a new block pool entry. 
The usage of the block pool depends on the write operation being perforaied: 

1) Small random writes (less than one depth's worth) - 

20 These writes are mapped into RAID 1 block pools. This allows the write to be done 

without a write penalty. These block pool allocations are ultunately written to their 
original blocks using a RAID 5 write, during background processing? 

2) Small sequential writes (less than one depth's worth)- 

These writes are mapped mto RAID 1 block pools. The block pool allocations are 
2^ done with extra blocks allocated so that new sequential writes will not immediately 

require an additional block pool allocation. 

3) Mediiun writes (random or sequential is not important) - 

A medium write is one that is large enough to span the disks being managed with 
a shallow depth. The blocks used are allocated from the block pool and the write 
operation is performed as a RAID 3 write. Since this is an allocated set of blocks 
that can start at any logical block, there is never an initial partial square and the 
ending partial square can have old data, since parity is generated before writing the 
set of blocks. The trailing partial will be wasted space, since there is no way to 
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write it later without a write penalty. 
4) Large writes (random or sequential is not important) - 

A large write is one that is large enough to span all the disks being managed at the 
depth used in the normal square. This type of write can be done without using the 
5 block pool since it can write to the regular square blocks as a RAID 3 write. This 

type of write can have a partial RAID 3 span in the front and the end. The front 
partial span is handled as a normal small or medium random write. The trailing 
partial RAID 3 span is also treated as a small or medium random write. 
Fig. 22 shows a flowchart of the ad^tive RAID write operation. Referring to Fig. 22, when 
1 0 a write command is received, block 2202 determines if the size of the data being written is less than 
the size of the depth. That is, will the. write be contained on a smgle disk. If so, block 2202 
transfers to block 2204 which determines whether this write sequentially follows the last write. If 
the write is not sequential, block 2204 goes to block 2206 v^ich allocates new space for the data 
from the block pool. The amount of space allocated is two times the size of the data being written, 
1 5 since the write will be perfomied as a RAID 1 write, which mirrors the data. After defming the size 
to be allocated, block 2206 goes to block 1121 which allocates the space from the block pool, block 
2214 then assigns this space to a RAID 1 configuration, and block 2216 writes the data. 

If the write sequentially followed the last write, block 2204 goes to block 2208 which 
determines whether space remains in the space allocated to the last write to contain this write. If 
20 so, block 2208 goes to block 2216 to write the data in the previously allocated space from the block 
pool. 

If no space is available, block 2208 goes to block 2210 which defines the space as two times 
the data size, plus extra space to accommodate additional sequential writes. The amount of extra 
space allocated varies with the number of sequential writes that have been perfomied recently. 
25 After defining the space, block 221 0 goes to block 2212 to allocate the space, then block 

2214 assigns RAID 1 configuration to the space, and block 2216 stores the data. 

If the data size is larger than the depth, block 2202 goes to block 221 8 which determines 
whether the data size will span all the disks, that is, is the size large enough for a RAID 3 write. 
If the data will span all disks, block 2218 goes to block 2226, which writes the data directly to a 
30 square, since the write can be performed as a RAID 3 write, with no write penalty. 

If the data si2:e is larger than one disk, but smaller than the span of all disks, block 22 1 8 goes 
to block 2220 which allocates data space for the write from the block pool. This data space is the 
size of the data being written, plus parity. Block 2222 then assigns this space as RAID 3 
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configuration, and block 2224 writes the data to the space. 
Aging/Recollection Considerations 

When a block pool entry is allocated, it uses up a limited resource (i.e. the blocks in the 
block pool). At some point it may be necessaiy to move the data being stored in these blocks back 
5 to their original blocks. 

There are a number of considerations for this decision: 

1) When block pool allocations are made for a RAID 1 operation, unused blocks are 
left in the original portion of the data square, which is inefBcient. The allocated 
block pool space is also inefficient, since half of the disk blocks are used for parit}% 
10 whereas storing the data back into the square, in a RAID S layout, ules less than 

half the blocks used for parity. If the RAID 1 blocks are updated frequently by the 
host, however, it is advantageous to leave the blocks allocated in the block pool, to 
avoid the overhead of constantly cleaning up and then reallocating the block pool 
entries. 

IS 2) When block pool allocations are made for a RAID 3 write, unused blocks are left 

in the original portion of the data square, which is inefficient. The allocated block 
pool space is efficient, however, since it is stored in a RAID 3 configuration. If 
entire rows are replaced, the blocks in the original portion can be given to the block 
pool. 

20 3) Block ix)ol allocations in RAID 1 configuration are always returned to their original 

block locations, to free up the block pool area for other uses. 

4) Depth considerations determine when and if to move RAID 3 ^bIock pool 
allocations back to their original locations. When a write occurs, space may be 
allocated at a depth less than the depth of the data in the squares, to allow a smaller 

25 write to become a RAID 3 write. In this case, the data will be moved back to the 

squares where it is stored more efficiently. 

5) The more block pool allocations there are, the larger the configuration data 
structures, used to manage the block pool, become. This growth can result in longer 
search times and ultimately in running out of space for the configuration data 

30 structures. Therefore, the system constantly works in the background to collapse the 

configuration data structures back to their original rectangles configuration. The 
main reason to not continually col. apse the configuration is because "hot spots", 
wherein the host updates an area of data frequently, should be left in a RAID 1 
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conri(!ura!ion 

6) When bliK ks are allocated for a RAID 1 allocation of a small write, extra blocks are 
ul:i»4:atc J 7 hose extra blocks are used to allow sequential small writes to use the 
cxtru hl.K I. . v\iAout additional non-consecutive allocations. These extra blocks are 

5 monav^ - '^- h thai if the block pool is exhausted the extra blocks that are not being 

use J car. rx rc^movcd and returned to the available block pool to be used for other 

7) p. H : -.ra^c has a different, more shallow, depth for RAID 3 allocations to 
ensure l -ss space is wasted. In this case, the system may end up with more 

10 <>pcrj:.. r uncrc subsequent read operations cross depth boundaries and cause a 

Fig r? sat. •hart of the background processing described above. Referring to Fig. 
23, after entn . r^.* v k : : jrtcrmines whether any block pool allocations have been made. If not, 
or after priKcssin. a! i ! i.-x-rr: Hock 2302 returns. If unprocessed block pool allocations remain, 

15 block 2302 ^'ivs i.. r i. ^ i ; -iux which detemiines whether any RAID 1 configuration allocations 
are present l! ^• i :"(a transfers to block 2306 which selects the first or next RAID 1 
allocation liU^k : »v Jctc-r-nmcs whether all RAID 1 allocations have been processed, and if not, 
goes to : ' ! • • ^ t... r Jctcrmines whether the RAID 1 allocation selected in block 2306 has 

been reccr.tl> urOjtc J : • a ri.K:k pool allocation has been recently updated, it will not be moved 

20 back to the suuarr . vfv c t« more efficient to keep it as a RAID 1 allocation, rather that frequently 
re-allocaiini: ncv^ k :s , \pacc. Although how often updates must occur to prevent rewriting 
back into the ^^•aa;c^ s^*^" «^ Ucpcndent upon the type of activity from the host, one example might 
be to re-uTitc atir- rv,» updates have occurred within the last second. Therefore, if the block pool 
allocation has Krcn rcwcr.!l\ updated, block 2310 goes back to block 2306 to select the next block 

25 pool allocation 

If the aliiKatii>n hx^ not been recently updated, block 23 10 goes to block 2312 which writes 
the data from the bliKk p^hW allocation back into the location in the square, and block 2314 frees 
the space Irom the bUKk p^k>I allocation and returns it to the block pool. Block 23 14 then returns 
to block 2306 to panrcvs the next RAID 1 block pool allocation. 
30 After all R AID 1 h!tx:k pool allocations have been processed, or if there are no RAID 1 

block pool allocauons, control goes to block 2316 to process RAID 3 allocations. Block 2316 
determines if there arc K.\1D 3 allocations to process, and if so, goes to block 2318 which selects 
the first or next R-\!D 3 alkKaiion. Block 2320 then detemiines if this allocation has an inefficient 
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depth, as discussed above. If so, block 2320 goes to block 2322 which writes the data back to the 
original squares, and then block 2324 frees the block pool allocation space and returns it to the 
block pool. Block 2324 then returns to block 23 1 6 to process the next RAID 3 allocation. 

If the depth is efficient block 2320 goes to block 2326 >^ich frees the space in the original 
5 square to the block pool, and connects the block pool allocation space, containing the RAID 3 data, 
into the location of the original square. Thus the data is connected into the original square without 
being moved. Block 2326 then returns to block 2316 to process the next RAID 3 allocation. 

After cdl RAID 3 allocations have been processed, block 2316 returns to block 2302. 
Request Processing 

10 Adaptive RAID can easily end up with a substantial number of squares portions. These 

squares portions are independent and may contain data in a variety of RAID configurations. This 
complexity leads to several requirements and/or implementations: 

1 ) The searching of the configuration can be linear when the configuration is small. 
But when the configuration gets large it can require substantial time to do linear 

15 searching. Thus it is necessary to provide additional support using hardware and/or 

software to limit the time spent searching the configuration data; 

2) Because of the dynamic nature of the configuration, all read and write operations 
must lock sector ranges to assure that concurrent requests cannot cause changes to 
the same location. 

20 3) Access to the configuration structures must be tightly limited to as few procediu-es 

as possible to assure integrity of the structure, thus only one process/request can be 
accessing and/or modifying the configuration structures at any one time. A 
read/write request will result in a list to be generated for the physical sectors 
involved. This list can only be generated after the sector range lock is executed. 
25 Once the list is generated, the configuration structures are not used, so they may be 

modified by other requests. The sector range lock assures that the physical sectors 
specified in the list cannot change position or be moved in the configuration. 
4) The configuration structure can be very dynamic, it must be saved across power off 
situations, and it must be able to survive failures of the controller as well as short 
30 power failures. 

Having thus described a presently preferred embodiment of the present invention, it will be 
understood by those skilled in the art that many changes in construction and circuitry and widely 
differing embodiments and applications of the invention will suggest themselves without departing 
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from the scope of the present invention as defined in the claims. The disclosures and the 
description herein are intended to be illustrative and are not in any sense limiting of the invention, 
defined in scope by the following claims. 
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CLAIMS 

What is claimed is 

1 I. Inanarrax o! s:oraircd.-xiccs(106, 110,112,114, 11 6) accessible from a host computer system 

2 (102), wherein dau s!.»r-.J by said host computer system (102) on each one of said storage devices 

3 (1 06, 1 1 0. 1 1 2. 1 : 4 I ; f . . IS distributed by said array across all storage devices in said array, a 

4 method of ori:ar.:/:ru s:..ra-c devices within said array, said method comprismg the steps of: 

5 (a) oy^ir.r: cau blocks on said storage devices into at least one block group (1902, 

6 •"►' I and a block pool (1908); 

7 (b) unr-i I uTinen to said array, allocating blocks from said block pool (1908) to 

8 ti rrr j i,-r of allocated blocks to store said data; « 

9 (c ) v^T>r- I not being written to said array, placing data from said allocated blocks 
irii W.J Lait one block group (1902, 1904, 1906). 



10 



1 2. The mcihivj . vo.crcin step (c) comprises the step of copying said data to said at least 

2 one block yroup !** : : *'-i.l906). 

1 3. The mcihiKi . r . i jirr i u herein step (c) comprises the step of exchanging blocks from said at 

2 least one block p-.>ur i *^ : !^^04, 1906) with said allocated blocks. 

1 4. The mcihixi . : . ijin i v. ncrrin a size of each of said at least one block group (1902, 1904, 1906) 

2 is dynamicali v cf ar. cj*- c ^^ a user of said method. 

1 5. The method t.: cu:rr. I v^ftcrcm a protection status of each of said at least one block group (1902, 

2 1904, 1906> IS d>namualiv changeable by a user of said method. 

1 6. The mcilKKl u: tlaim 1 wnerein data stored in said at least one block group (1902, 1904, 1906) 

2 is stored in a ilrst corfiiruraiion and data stored in said allocated blocks is stored in a second 

3 configuration. 

1 7. The method of claim c u herein said first configuration and said second configuration are 

2 selected by said arra) 

1 8. The method ol claim I wherein step (b) further comprises the steps of: 
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2 (bl) when said data being written is not located sequentially after data previously 

3 written, allocating more blocks than necessar>' to contain said data being written; 

4 and 

5 (bl) when said data being written is located sequentially after data previously written, 

6 writing said data into allocated blocks containing said previously written data. 

1 9. The method of claim 1 wherein step (b) further comprises the steps of: 

2 (bl ) when said data being written is no larger than an amount of data that can be stored on 

3 a single disk within said array, allocating an amount of said blocks sufiRcient to 

4 allow said data to be written twice, wherein said data is duplicated vsdthin said 

5 allocated blocks. . 

1 1 0. The method of claim 1 wherein step (b) further comprises the steps of: 

2 (bl ) when said data being written is an amount of data that requires writing to all disks of 

3 said array, writing said data in said at least one block group (1902, 1904, 1906), 

4 wherein no blocks are allocated from said block pool. 
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